Today we’ll be working with the diamonds dataset from the ggplot2
package. We want to understand how various features of the diamond influence its price.
Let’s load the ggplot2
package and the diamonds dataset. (Install the package with install.packages("ggplot2")
if you have not done so yet.) Look at the documentation to understand what the dataset is about.
library(ggplot2)
data(diamonds)
?diamonds
As usual, we can use str()
, head()
or View()
to see the dataset:
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Let’s practice some of the basic plotting that we learnt last session! (Note: Some of these plots may take a while to load as our dataset is quite big.)
price
. Vary the number of bins to see what happens.price
vs. carat
. Adjust the alpha to 0.05 to reduce overplotting. Do you see any patterns in the data?price
for each value of cut
, then make a violin plot instead. How do these plots differ in the information that they give the reader?Bar plots are useful in describing how often each category appears for a categorical variable. The code below makes a bar plot to show how many diamonds there are for each cut
type:
ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()
Note that for shorter syntax, we can drop data =
in ggplot()
if our dataset is the first argument within the braces. We can also drop mapping =
if (i) it is the second argument within the braces for ggplot()
, or (ii) it is the first argument within the braces for the geom_xx()
functions. For example, the code below will give the same plot:
ggplot(diamonds, aes(x = cut)) +
geom_bar()
To make the bars horizontal instead, we can add coord_flip()
:
ggplot(diamonds, aes(x = cut)) +
geom_bar() +
coord_flip()
Layering allows us to make more sophisticated and informative plots. Let’s go back to the scatterplot of price vs. carat:
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.05)
There certainly seems to be a positive relationship between the two, even though there seems to be a lot of noise too. We can add a geom_smooth()
layer that tries to smooth out the noise:
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
geom_point(alpha = 0.05) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The heavier the diamond, the more expensive it is. At the same time, we see quite a wide spread of prices for diamonds of the same weight, indicating that there are probably other factors at play.
Let’s go back to the boxplot of price
for each value of cut
:
ggplot(diamonds, aes(x = cut, y = price)) +
geom_boxplot()
It seems unintuitive that the cut of a diamond does not affect its price, and that diamonds of ideal cut have lower prices. Could there be other factors at work? One possibility is that there just aren’t many large diamonds of ideal cut: thus, a diamond of ideal cut tends to weigh less (smaller in carat size), and hence fetches a lower price.
We can explore this theory by modifying other aesthetics. For example, in the scatterplot of price
vs. carat
, we can let the color of each dot signify its cut:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2)
There seem to be more yellow dots on top and more purple dots below, lending credence to the intuitive assumption that better cut results in better quality. In this case, changing the color of the dots helped us to understand the data better.
The colors here are the R defaults. We can introduce our own color scale with scale_color_brewer()
to make the plot more informative (the full list of color palettes can be found through google image search):
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_color_brewer(palette = "YlOrRd")
There’s still a fair amount of overplotting going on. Can we have separate graphs of price vs. carat for each cut?
This is called splitting the plot into facets. R allows us to do this by using the function facet_wrap()
. Use the following code to facet the plot by a single variable:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
facet_wrap(~ cut)
By default, R put just 3 subplots in each row. We can change this by adding a nrow
argument to facet_wrap()
:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
facet_wrap(~ cut, nrow = 1)
Facetting didn’t help too much in this case, since the plots for the better cuts look very similar to one another. Perhaps we could add a smoothing layer to the original plot:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As you can probably see, the possibilities are endless! You can try plotting different variables against each other and see if you get anything interesting.
If we want to facet by more than 1 variable, we can do so with facet_grid()
. The variable before the ~
sign will be split on the rows, while the variable after the ~
sign will be split on the columns:
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.2) +
facet_grid(cut ~ color)
Let’s say you’re satisfied with the scatterplot of price
vs. carat
with color denoting cut
, and that you want to share it with others. The first thing you should do is label your axes and give your plot a title:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price")
The size of the labels seems a bit small. We can adjust them using the theme()
function. Let’s centralize the plot title at the same time:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") +
theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.2)))
We can move the legend around by setting a legend.position
argument in theme()
(possible options are “none”, “left”, “right”, “bottom”, “top”):
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") +
theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.5)),
legend.position = "bottom")
Just about everything in the plot can be modified. For a full (long!) list of attributes which can be modified, see this reference.
We can also try changing the overall theme of the plot and see if any of them make the visualization more effective:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") +
theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.5)),
legend.position = "bottom") +
theme_bw()
Notice how the legend is not at the bottom and that the plot and axis titles are back to the defaults? This is because we applied theme_bw()
last. When we apply theme_bw()
, it overwrites all the changes to the theme that we specified in theme()
. To avoid this overwrite, we can simply reorder the code:
ggplot(diamonds, aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price") +
theme_bw() +
theme(plot.title = element_text(size = rel(2), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.5)),
legend.position = "bottom")
For a list of complete themes, see this link.
It seems tedious to be changing these attributes for each graph we make. The nice thing about ggplot is that it lets us assign each part of the plot as a variable! For example, we could have reproduced the plot above using this code:
p <- ggplot(data = diamonds, mapping = aes(x = carat, y = price, col = cut)) +
geom_point(alpha = 0.2) +
scale_colour_brewer(palette = "YlOrRd") +
labs(x = "Carat", y = "Price", title = "Plot of carat vs. price")
th <- theme(plot.title = element_text(size = rel(1.5), face = "bold", hjust = 0.5),
axis.title = element_text(size = rel(1.2)),
legend.position = "bottom")
p # plot without the theme changes
p + th
I can now apply these adjustments to any plot I want by adding + th
at the end of the code:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = price)) +
labs(title = "Histogram of price", x = "Price", y = "Count") +
th
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The color scale functions in ggplot2 are of the form scale_x_y
, where x
is either color
or fill
, and y
is either brewer
or distiller
. color
is for the outline while fill
is for the interior; brewer
is when we have a discrete number of colors while distiller
is for continuous scales. For example, if we wanted to color the points based on price
, we would use scale_color_distiller
:
ggplot(diamonds, aes(x = carat, y = price, col = price)) +
geom_point(alpha = 0.2) +
scale_color_distiller(palette = "YlOrRd")
Some of you might have picked up on an inconsistency here: didn’t I say that scale_color_distiller
is for the outline of the shape? Why then is the fill of the points in the plot above changing?
This has to with the shapes
aesthetic. R has 26 in-built shapes:
Shapes 0-20 only have the col
attribute while shapes 21-25 have both col
and fill
attributes. We can see this in action when we change the shape
aesthetic in the previous plot:
ggplot(diamonds, aes(x = carat, y = price, col = price)) +
geom_point(alpha = 0.2, shape = 21) +
scale_color_distiller(palette = "YlOrRd")
It looks almost the same as before but if you look closely, the points have no fill in the interior. The code below makes the fill black:
ggplot(diamonds, aes(x = carat, y = price, col = price)) +
geom_point(alpha = 0.2, shape = 21, fill = "black") +
scale_color_distiller(palette = "YlOrRd")
Sometimes we may want to zoom in on a particular part of the plot. For example, look at the scatterplot of carat
vs. z
:
ggplot(diamonds, aes(x = z, y = carat)) +
geom_point(alpha = 0.2)
While the default plot shows us all the data, most of the plot is wasted space to accommodate a single outlier. The following code allows us to define the limits of the x-axis (only from 1 to 8.5):
ggplot(diamonds, aes(x = z, y = carat)) +
geom_point(alpha = 0.2) +
scale_x_continuous(limits = c(0, 8.5))
## Warning: Removed 1 rows containing missing values (geom_point).
R helpfully warns us that the one outlier was removed before plotting.
Instead of using scale_x_continuous()
, we could also use coord_cartesian()
to achieve the same effect:
ggplot(diamonds, aes(x = z, y = carat)) +
geom_point(alpha = 0.2) +
coord_cartesian(xlim = c(0, 8.5))
Notice that in this case, R does not warn us about the outlier. That is because the two functions works differently. With scale_x_continuous()
, R removes all points outside the limits, then plots them. With coord_cartesian()
, R plots all the points, then zooms in on the specified range. This difference might not seem like a big deal but it can make a difference in some cases. For example, the code below draws a jagged line:
n <- 15
df <- data.frame(x = cos(2 * pi * 1:n / n),
y = sin(2 * pi * 1:n / n))
ggplot(df, aes(x = x, y = y)) +
geom_line()
If we only want to zoom in on the part above the x-axis, coord_cartesian()
does the right thing:
ggplot(df, aes(x = x, y = y)) +
geom_line() +
coord_cartesian(ylim = c(0, 1))
scale_y_continuous()
, on the other hand, does something funky. That’s probably not what we want in this case.
ggplot(df, aes(x = x, y = y)) +
geom_line() +
scale_y_continuous(limits = c(0, 1))
## Warning: Removed 2 rows containing missing values (geom_path).
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 plyr_1.8.4 RColorBrewer_1.1-2
## [4] pillar_1.4.2 compiler_3.6.1 tools_3.6.1
## [7] zeallot_0.1.0 digest_0.6.20 viridisLite_0.3.0
## [10] lattice_0.20-38 nlme_3.1-140 evaluate_0.14
## [13] tibble_2.1.3 gtable_0.3.0 mgcv_1.8-28
## [16] pkgconfig_2.0.2 rlang_0.4.0 Matrix_1.2-17
## [19] cli_1.1.0 yaml_2.2.0 xfun_0.9
## [22] withr_2.1.2 dplyr_0.8.3 stringr_1.4.0
## [25] knitr_1.24 vctrs_0.2.0 grid_3.6.1
## [28] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
## [31] fansi_0.4.0 rmarkdown_1.15 reshape2_1.4.3
## [34] purrr_0.3.2 magrittr_1.5 splines_3.6.1
## [37] scales_1.0.0 backports_1.1.4 htmltools_0.3.6
## [40] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
## [43] utf8_1.1.4 stringi_1.4.3 lazyeval_0.2.2
## [46] munsell_0.5.0 crayon_1.3.4